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Scotomas 

Our “lock-outs,” or blind spots, vvc will refer to in this video curriculum as “scotomas.” It is 
not a.jargon word. It is a very useful word, because it describes everyone’s blind spots. Whenever 
you hear anyone wailing, “1 don’t got it,” or “It doesn’t make sense to me,” you can be assured 
that person has a scotoma. I want you to understand scotomas. They are very important as you 
break out of the limiting and into more valid ways of thinking. Make no mistake: friendships dissolve 
because of scotomas, marriages fail because of scotomas; nations, companies, schools, have 
scotomas. A scotoma is the block — the blind spot — that keeps us from seeing the truth, the many 
optional truths around us. 

A scotoma is the sensory locking-out of our environment. We are imprisoned by our own y/ 
“blind spots” because of our preconceived way of seeing things, our habitual way of 
doing things, our “lock-out” notion of what can be done. A scotoma causes us to see 
what we expect to see, hear what we expect to hear and think what we expect 
to think. “Oh, we’ve always done it that way.” “He’ll never be able to do 
it.” “That company will never buy our product — it never has.” 

Ad infinitum. 

So “locking on” and “locking out” create our “blind 
spots,” our scotomas. I have both bad news and good 
news about our blind spots, or scotomas. 


The bad news: A great problem with 
scotomas is that we often don’t know that 
we have them. We go about our daily y 

routines, running a business, rais- 
ing a family, doing our jobs, 
in a state of semi-myopia. 

We don’t see the 
many optional 


truths 
around us. 


/ Do 1 have certain “habits” at work? 

Conversely, do I “Jock out” certain 
opinions, say, because a woman expresses them? 


“The teacher is the piipii. 
The pupil is the teacher/’ 
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Analyzing the Performance of Learning Algorithms 


David Haussler and Manfred K. Warmuth 
Department of Computer and Information Sciences 
University of California at Santa Cruz, CA 95064 


Abstract. 

We discuss the approach to the analysis of learning algorithms that we have taken in our 
laboratory and summarize the results we have obtained in the last few years. We have worked 
on refining and generalizing the PAC learning model introduced by Valiant. Measures of per¬ 
formance for learning algorithms that we have examined include computational complexity, 
sample complexity, probability of misclassification (learning curves), and worst case total 
number of misclassifications or hypothesis updates. We have looked for theoretically optimal 
bounds on these performance measures, and for learning algorithms that achieve these bounds. 
Learning problems we have examined include those for decision trees, neural networks, finite 
automata, conjunctive concepts on structural domains, and various classes of Boolean func¬ 
tions. We also worked on clustering data represented as sequences over a finite alphabet. 
Many of the new learning algorithms that we have developed have been tested empirically as 
well. 


Introduction 

Recent years have brought a significant increase in research activity in the area of machine 
learning, both through increased interest among mainstream artificial intelligence researchers 
(see e.g. [MCM 83,86]) and through the resurgence of interest in connectionist/neural net 
models (see e.g. [RM86] [Hi88]). The researchers in this new field come from a variety of dis¬ 
ciplines, including artificial intelligence, statistical pattern recognition and decision theory, 
neurobiology, cognitive science, and the theory of algorithms and computational complexity. 
While this confluence of disciplines has stimulated recent progress, it has also led to a "tower of 
Babel" problem, in which differences in language and methodology make it difficult to com¬ 
pare the results obtained. Without attempting to impose a uniform language and methodology 
on the field as a whole, it is our feeling that a significant part of the empirical work, namely the 
part that has been called learning from examples [CF82], can be treated in a clear, quantitative 
manner that is useful to the practitioner and is based on solid theoretical foundations. In this 
paper we summarize some of the work that we have done in our laboratory in this direction. 

Of the fields mentioned above, our approach has the strongest kinship with the recent 
learning research by those trained in the theory of algorithms and computational complexity. 
Thus we view a learning strategy as an algorithm, and apply the techniques and perspectives 
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from this field to analyze its capabilities and performance. Early efforts along these lines were 
based primarily on the inductive inference model introduced by Gold [G67]. However, most 
work in this model has not placed sufficient emphasis on minimizing the resources required by 
learning algorithms to be of much use in actual learning applications. More recently. Valiant 
[V84] [V85] [PV88] [KLPV87] has introduced a probabilistic model for the study of learning 
algorithms which is often called the the PAG model (for Probably Approximately Correct 
learning) [Ang88]. This model has been more successful in addressing some of the require¬ 
ments that are typically placed on a learning algorithm in practice. 

In most formal models of learning from examples, the task is to identify an unknown tar¬ 
get function / based on examples of that function (i.e. pairs of the form (jc/(jc))). To allow 
greater emphasis on the computational efficiency of the learning algorithm, the PAG model 
requires only that a good approximation to the target function be found with high probability, 
rather than requiring exact identification of the target function. We elaborate on this model in 
the following section. Recent results have demonstrated that there are efficient learning algo¬ 
rithms that achieve this type of probably approximately correct learning for many types of tar¬ 
get functions [KLPV87], [BEHW89], (H88]. 

In our work we have concentrated on refining and generalizing the PAG model. We have 
identified a number of ways that the performance of learning algorithms can be quantified in 
terms of the amount of resources that they consume in order to achieve a given level of perfor¬ 
mance. The most important resources are computation time and space, and the number of 
training examples used. We have tried to derive theoretical bounds that indicate the optimum 
performance that can be expected of any learning algorithm given particular resource con¬ 
straints, and to develop algorithms that approach this optimum. We have looked for trade-offs 
between resources, and for general algorithm transformations that can improve the way a learn¬ 
ing algorithm utilizes one resource without seriously degrading its utilization of others. We 
also look for algorithm transformations that make learning algorithms more robust to noise and 
other types of anomalies in the training data. 

In addition to our theoretical work, we have also experimented with a number of the learn¬ 
ing algorithms that we have developed. We have found experimental evaluation to be an impor¬ 
tant counterpart to theoretical analysis. Experiments can only estimate a learning algorithm’s 
performance on particular distributions of training examples, whereas the theoretical bounds in 
the PAG model hold for any distribution. Nevertheless, experiments can sometimes provide a 
go(^ indication of typical performance when theoretical analysis is intractable, and can also 
indicate when theoretical worst case upper bounds are overly pessimistic in practice. 


Definitions 

There are many ways that learning performance can be measured. In any given applica¬ 
tion, the appropriate metrics will depend largely on how the goal of learning is defined, and on 
what resources are deemed critical. In the PAG model, the goal of learning is to produce a 
good approximation to an unknown target function. For target functions that take on only two 
possible values (usually called concepts ), this can be made precise as follows. 

We assume that the target concept / is a (0,1)-valued function on a given domain X 
(called the instance space). Typically / is a Boolean function, i.e. the instance space X is 
defined by some set of Boolean attributes. Random examples of the target concept / are 
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generated by drawing instances independently at random from the instance space X, and for 
each instance x, forming the pair (x / (x)). It is assumed that instances are drawn according to 
a fixed probability distribution P on the instance space X. We make no assumptions about the 
probability distribution by which the instances are selected, other than that it remains fixed and 
that the instances are chosen independently. In particular, the distribution is not assumed to be 
known to the learning algorithm. 

From these random examples of the target concept f , a learning algorithm produces a 
hypothesis h, which is also a function from the instance spaced to {0,1). The accuracy of this 
hypothesis h is the probability that it will agree with the target concept / on a randomly drawn 
instance, i.e. 

accuracy(h) = P({x e X : h(x) = /(x) )). 

A good approximation to the target concept / is a hypothesis h with high accuracy. 

Notice that the accuracy of the hypothesis is measured with respect to the same probabil¬ 
ity distribution that is used to generate the training examples. This is an essential part of the 
model. Under this assumption, it has been shown that many elementary learning strategies are 
guaranteed to produce a hypothesis with high accuracy with high probability, regardless of the 
underlying distribution P on the instances [V85],[H88],[BEHW89]. 

For PAG learning, the most important measures of performance are sample complexity 
and time complexity . The time complexity of a learning dgorithm refers to the computational 
time it takes to produce a hypothesis from a given sequence of examples. The sample com¬ 
plexity of a learning algorithm is defined in terms of the number of random examples needed so 
that the hypothesis produced has high accuracy with high probability. It is a function that expli¬ 
citly depends on the accuracy demanded of the hypothesis, and the confidence with which that 
accuracy is achieved. In general, the sample complexity will depend on the probability distri¬ 
bution that governs the generation of examples. In the PAG model, unless otherwise specified, 
sample complexity refers to the worst case sample complexity over any distribution that may 
be used to generate the training examples. Note that it is typically difficult to know in experi¬ 
mental learning studies just what distributions will prove the hardest for an algorithm to handle, 
or to know just what distributions will be encountered in particular application environments. 
Thus the guarantees supplied by this model on the performance of an algorithm for arbitrary 
distributions are of significant practical importance. 

For incremental learning algorithms, i.e. algorithms that process examples one at a time 
and update a current hypothesis after each new example, other performance measures are 
relevant. One is space efficiency , defined in terms of the amount of memory space used to 
keep the current hypothesis and other data between examples. Another is update time , i.e. the 
time it takes to update the current hypothesis given a new example. 

Incremental learning algorithms are usually used in settings where the current hypothesis 
is used to make a prediction of the value of the target concept on a given instance x, and subse¬ 
quently told whether or not that prediction was correct. This type of interaction occurs when¬ 
ever the module that incorporates the learning procedure is required to do useful work while it 
is learning, e.g. in robotics applications. We call this an on-line learning setting. 

The on-line learning setting is also associated with another useful learning measure, the 
number of mistakes during learning. A mistake occurs whenever the prediction of the learning 
algorithm is incorrect. Since most on-line learning algorithms only update their hypotheses 
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when a mistake is made, this measure is usually the same as the number of hypothesis updates 
or mind—changes during learning. The classical perceptron convergence theorem gives a 
worst case bound on this number for the perceptron learning algorithm applied to the target 
class of linear threshold concepts (see e.g. [DH73]). This is not a probabilistic bound. No 
assumptions are made on the order in which instances are given. Our colleague Nick Little- 
stone has developed a new variant of the perceptron algorithm that gives better bounds in many 
important cases, and examined the relationship between this type of "mistake" bound and the 
bounds on sample complexity in PAG learning [Li87,89],[HKLW88]. 

A probabilistic performance measure can also be defined for the on-line setting. It is sim¬ 
ply the probability of making a mistake on the instance. Here we assume that the instances 
are drawn independently at random from some fixed distribution on the instance space, as in the 
standard PAG model. Since in the on-line model learning occurs after each instance, we expect 
the probability of making a mistake to go down at the instance number t increases. Plotting this 
probability of a mistake as a function of t gives what is typically called a learning curve. In 
(HLW88] we explore the close relationship between these learning curves and the sample com¬ 
plexity bounds in PAG learning. 


Results 

In this section we briefly discuss the results we have obtained by applying the methodol¬ 
ogy outlined in the introduction to the performance measures defined above. These results were 
obtained in collaboration with many of our colleagues, including Eric Baum, Anselm Blumer, 
Andrzej Ehrenfeucht, David Helmbold, Michael Kearns, Lenny Pitt, Bob Sloan, Les Valiant, 
and Emo Welzl, and our PhD students Nick Littlestone (now at Aiken Gomputing Lab., Har- 
vard), Aleksandar Milosavljevic and Giulia Pagallo. Although it represents only a small frac¬ 
tion of the recent work in this area, due to space limitations we restrict ourselves in this report 
to work that was done at least partly in our laboratory. For a more complete picture of recent 
work in this area, we refer the reader to the proceedings of the two workshops on computational 
learning theory [HP89] [R89]. Our main results are as follows. 

(1) The original PAG learning model was defined only for target concepts on Boolean instance 
spaces. We have extended this model to AI instance spaces that include real-valued attributes 
[BEHW89], multi-valued attributes with hierarchical value structure [H88], and structured, 
multi-object instances (e.g. blocks-world scenes) [H89c]. 

We have also generalized the model so that it can be used to analyze algorithms that learn 
multi-valued functions including real and vector-valued functions, as well as functions that take 
values in a finite or countably infinite set [H89a,b]. Here the accuracy of a hypothesis is 
defined as the average distance between the value that it predicts on a given random instance 
and the value that is observed for that instance. Thus a hypothesis with high accuracy is one 
that predicts values that are close to those actually observed, like a good scientific theory. 
There is a great deal of flexibility in how the distance between predicted and observed values 
can be defined, so a wide variety of performance measures can be cast in this framework, 
including those commonly used in statistical pattern recognition and neural net research (e.g. 
mean squared error, etc.) The generalized model also handles a wide variety of "noise" 
processes, so that one does not need to assume that the training data is generated precisely 
according to some underlying target function. This is seldom a realistic assumption when the 



5 


data is real-valued. 

(2) For concept learning, we have provided a general combinatorial characterization of PAC 
learnability using the Vapnik-Chervonenkis (VC) dimension [VC71], [Vap82], [HW87], 
(BEHW89]. This has led to the discovery of necessary and sufficient conditions on a class of 
target concepts for the existence of PAC learning algorithms with polynomial sample and time 
complexity, and demonstrations of such algorithms for several types of concept classes 
[BE1IW89]. 

Using further work of Vapnik and Chervonenkis [Vap82], and also woric of Dudley 
[Du84] and Pollard [Po84], we have obtained related results for learning real and vector-valued 
functions [H89a,b]. However, here we have only sufficient conditions. Using these results, we 
have obtained upper bounds on the number of training examples needed for learning with feed¬ 
forward neural networks and radial basis functions [BH89] [H89a,b]. As above, here we do not 
need to assume there is an underlying target function. 

(3) The VC dimension has also proven useful in obtaining lower bounds for sample complexity. 
General sample complexity lower bounds for learning target concepts given by it-DNF and k~ 
CNF Boolean expressions, symmetric functions, decision lists [Riv87] and linear threshold 
functions are given in [EHKV89]. These results show that a number of important learning algo¬ 
rithms have sample complexity within either a constant or logarithmic factor of optimal. These 
algorithms include the classical AI algorithm for conjunctive concepts on instance spaces with 
tree-structured and linear attributes, and greedy variants of this algorithm for k-DNF, k-CNF 
and internal disjunctive concepts [H88]. 

(4) We have demonstrated that the principle of preferring the simpler hypothesis, usually called 
Occam’s Razor, leads to provably good learning performance [BEHW87]. We have explored 
techniques for efficiently implementing this heuristic using a greedy algorithm [BEHW89], 
[H88], [H89c]. 

(5) We have shown that the target concept classes that have PAC learning algorithms with 
polynomial sample and time complexity are the same for nearly all variants of the PAC learn¬ 
ing model that have been considered by various authors [HKLW88]. 

(6) For many natural concept classes no efficient learning algorithm is known. We suspect that 
many of the.se are in fact difficult to leam, but we currently lack techniques to settle the ques¬ 
tion. To gain a clearer picture, we have begun to develop a theory along the lines of the theory 
of NP-completeness that allows us to compare the difficulty of learning various classes. We 
have defined a notion of reducibility that preserves efficient learning. With respect to this type 
of reduction we have shown certain learning problems to be complete for standard complexity 
classes such as LOGSPACE, LOGCFL, and P. 

As in the theory of NP-completeness, when a problem is complete for its class, then 
finding an efficient learning algorithm for it implies that an efficient algorithm exists for every 
learning problem in its complexity class. For the richer complexity classes, this strong implica¬ 
tion provides evidence that efficient learning algorithms may not exist for these complete prob¬ 
lems. We use completeness proofs to establish hardness results for learning various natural tar¬ 
get classes [PW88]. Very important recent work of Kearns and Valiant also uses our notion of 
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reduction to show that certain fundamental learning problems, including the problem of learn¬ 
ing finite automata, are intractable, based on cryptographic assumptions [KV89]. We also give 
related negative results for the problem of learning finite automata, based only on the weaker 
assumption that P NP [PW89]. 

(7) Nick Littlestone has developed a theory' of optimal mistake-bounded algorithms for the on¬ 
line setting. These algorithms make the minimal total number of mistakes of prediction during 
learning for any target concept in a given target class and any sequence of instances from the 
instance space. He gives general constructions for optimal mist^e-bounded algorithms that 
work for any concept class, and in a few cases yield computationally efficient algorithms 
[Li87]. He has also extended these results to the probabilistic setting in which the probability of 
a mistake on the r'* instance is the primary performance measure |HLW88]. Here again, new 
combinatorial techniques using the Vapnik-Chervonenkis dimension play an important role. 

(8) One of the computationally efficient learning algorithms that Littlestone has developed is a 
variant of the classical perceptron learning algorithm that has a significantly better mistake 
bound for many important concept classes [Li87]. For disjunctions and conjunctions, he shows 
that its mistake bound is within a constant factor of optimal. The algorithm converges more 
rapidly by making small multiplicative changes to the individual weights during learning, 
instead of additive changes. Although the basic learning algorithm can learn only linearly 
separable functions, by certain transformations of the training examples it can be made to leam 
other types of target concepts [Li87]. In particular, it can be applied to leam k -DNF concepts 
with a mistake bound that is within a constant factor of optimd. This is a significant improve¬ 
ment over previous incremental algorithms for this target class [V85]. 

(9) A variant of Littlestone’s algorithm called the weighted majority algorithm can be used 
more generally as a method of combining several learning algorithms into a single learning 
algorithm that is more powerful and more robust than any of the component algorithms 
[LW89]. In this scheme all of the component learning algorithms are run in parallel on the 
same training instances. For each instance, each algorithm makes a prediction and then these 
predictions are combined by a weighted voting scheme to determine the overall prediction by 
the "master" algorithm. After receiving feedback on its prediction, the master algorithm adjusts 
the weights for each of the component algorithms, increasing the weights of those that made the 
correct prediction, and decreasing the weights of those that predicted incorrectly. As in 
Littlestone’s basic algorithm, these weight changes are multiplicative, and hence are different 
from the type of additive changes that one gets by applying the usual gradient descent tech¬ 
niques used in connectionist learning algorithms. With respect to the mistake bound perfor¬ 
mance measure, we have shown that this method of combining learning algorithms by weighted 
voting can lead to algorithms that are nearly optimally robust with respect to anomalies in the 
training data, and that the performance of the master algorithm approaches the performance of 
the best component algorithm for any given learning task. 

(10) We have developed learning algorithms that apply to concept classes that include nested 
exceptions. These are concepts like "when to use form 1040X for your income tax" that include 
conditions like "use 1040X if you are married and have combined income greater than y unless 
you are over 65 and renting, except if you are also blind." Here there is a basic rule, then excep¬ 
tions to the rule, and then exceptions to the exceptions, etc. We have shown that under certain 
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conditions, individual learning algorithms for each of the different types of rules can be com¬ 
bined into a master algorithm that learns rules with nested exceptions [HSW89]. We show that 
this algorithm is optimal or nearly optimal with respect to a variety of learning measures. 

(11) In terms of experimental work, Giulia Pagallo has looked at the problem of learning 
Boolean functions with a short Disjunctive Normal Form representation using hypotheses that 
are decision trees [PH89a]. The novel aspect of her work is that her learning algorithm invents 
new attributes while it is learning, and uses these attributes to redescribe the training data at a 
higher level in order to facilitate the learning task (see also [Sch87]). 

The algorithm, called FRINGE, begins by building a decision tree from the training 
examples in the standard way (as in Quinlan’s ID3 method [Q86], see also [BFOS84]). Then it 
uses this decision tree to find new higher level attributes for the training examples, reexpresses 
the training examples using these attributes, and then repeats the process, building a decision 
tree from the modified training examples. She has demonstrated empirically that this algorithm 
outperforms the standard decision tree algorithm for learning short DNF formulae when the 
examples are drawn at random from the uniform distribution. Based on the work in [Riv87], 
she has also developed related learning algorithms for this task that use greedy methods and are 
somewhat more amenable to analysis [PH89b]. Further theoretical work on learning decision 
trees and DNF formulae is given in [EH89J. 

(12) In other experimental work, Aleksandar Milosavijevic has developed a clustering algo¬ 
rithm for data that consists of aligned sequences of letters over a finite alphabet [MHJ89]. This 
algorithm is intended for use in Biology, where the sequences of letters represent the chains of 
nucleic acids that form similar pieces of DNA, or chains of amino acids that form similar pro¬ 
teins. However, the principle is quite general, and may also be applicable to other types of 
sequences, such as those found in speech recognition and OCR applications. 

The algorithm, called MASC for Multiple Aligned Sequence Classifier, is based on the 
minimum description length principle [Ris78], in which the classification that is preferred is the 
one that gives the most compact description of the data. This method can be justified as an 
application of Bayes’ rule. The algorithm has been very successful in producing classifications 
that are acceptable to biologists, and is now being used as a research tool to provide initial 
hypotheses for new tjrpes of sequences that are being collected. 


Conclusion 

As mentioned above, our work represents only a small fi'action of the work that has been 
done on the analysis of learning algorithms in the last few years. The introduction of the PAC 
model has been a major stimulus of new research in this area. Yet it is not the only approach 
that has been explored. We and others have looked at non-probabilistic mistake bound models, 
as well as Bayesian models such as the minimum description length principle. 

The field is still young. New perspectives are emerging at a healthy rate. We don’t expect 
the pace to slow, nor do we anticipate a quick resolution of the fundamental problems. How¬ 
ever, we are pleased with the progress that has been made. 
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